housing_prices.csv data set and undertake an initial exploration of the data. You will find details on the data set on the relevant Kaggle pagehouse <- read_csv("data/housing_prices.csv")
library(GGally)
library(tidyverse)
glimpse(house)
## Rows: 19,675
## Columns: 10
## $ longitude <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.2…
## $ latitude <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37…
## $ housing_median_age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52,…
## $ total_rooms <dbl> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555,…
## $ total_bedrooms <dbl> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, …
## $ population <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 15…
## $ households <dbl> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, …
## $ median_income <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6…
## $ median_house_value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299…
## $ ocean_proximity <chr> "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NE…
total_rooms of houses to be strongly correlated with total_bedrooms. Use ggpairs() to investigate correlations between these two variables.ggpairs(house[,c("total_rooms", "total_bedrooms")])
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 200 rows containing missing values
## Warning: Removed 200 rows containing missing values (geom_point).
## Warning: Removed 200 rows containing non-finite values (stat_density).
total_bedrooms from the dataset, and use only total_rooms going forward.house_tidy <- house %>%
select(-total_bedrooms)
house_tidy
median_house_value of a house in terms of the possible predictor variables in the dataset.ggpairs() to investigate correlations between median_house_value and the predictors (this may take a while to run, don’t worry, make coffee or something).ggpairs(house_tidy)
median_income is strongly correlated with median_house_value. This is followed by a positive correlation between median_house_value and total_bedrooms and a negative correlation between median_house_value and latitude. Furthermore, The histograms of median_house_value split by ocean_proxmityalso show some variation.
ggplot visualisations of any significant correlations you find.house_tidy %>%
ggplot(aes(x = median_income, y = median_house_value)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
house_tidy %>%
ggplot(aes(x = total_rooms, y = median_house_value)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
house_tidy %>%
ggplot(aes(x = latitude, y = median_house_value)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
house_tidy %>%
ggplot(aes(x = ocean_proximity, y = median_house_value)) +
geom_boxplot()
ocean_proximity. Investigate the level of ocean_proximity predictors. How many dummy variables do you expect to get from it?house_tidy %>%
distinct(ocean_proximity)
# expect 4 dummies for ocean_proximity
median_house_value on median_income and check the regression diagnostics.mod1 <- lm(median_house_value ~ median_income, data = house_tidy)
par(mfrow = c(2,2))
plot(mod1)
summary(mod1)
##
## Call:
## lm(formula = median_house_value ~ median_income, data = house_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513966 -51564 -13979 36549 403935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45457.0 1359.0 33.45 <2e-16 ***
## median_income 39987.0 339.9 117.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74870 on 19673 degrees of freedom
## Multiple R-squared: 0.4129, Adjusted R-squared: 0.4129
## F-statistic: 1.384e+04 on 1 and 19673 DF, p-value: < 2.2e-16
# the residuals plots looks alright but there is some evidence to indicate skewness.
mod2_proximity <- lm(median_house_value ~ median_income + ocean_proximity, data = house_tidy)
summary(mod2_proximity)
##
## Call:
## lm(formula = median_house_value ~ median_income + ocean_proximity,
## data = house_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -479055 -41086 -11143 27912 376647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83581.2 1407.7 59.373 < 2e-16 ***
## median_income 35118.0 304.8 115.223 < 2e-16 ***
## ocean_proximityINLAND -72313.7 1100.7 -65.695 < 2e-16 ***
## ocean_proximityISLAND 200480.2 29232.9 6.858 7.19e-12 ***
## ocean_proximityNEAR BAY 16629.7 1591.9 10.446 < 2e-16 ***
## ocean_proximityNEAR OCEAN 15552.9 1500.8 10.363 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65340 on 19669 degrees of freedom
## Multiple R-squared: 0.5529, Adjusted R-squared: 0.5528
## F-statistic: 4865 on 5 and 19669 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(mod2_proximity)
# The predictors median income and ocean_proximity explains 55% of the variation in our dataset. In addition, these predictors are also significant. For every increase in median income by 35,118, the median house price increases by 83,581.2 if all other predictors remain the same. If the house is on an island, the median house prices increases by 200,480.2 if all other predictors remain the same.
log(medium_income) and your chosen categorical predictor. Do you think this interaction term is statistically justified?mod3 <- lm(log(median_house_value) ~ log(median_income) + ocean_proximity + log(median_income):ocean_proximity, data = house_tidy)
summary(mod3)
##
## Call:
## lm(formula = log(median_house_value) ~ log(median_income) + ocean_proximity +
## log(median_income):ocean_proximity, data = house_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.26174 -0.21895 -0.02053 0.20040 2.18571
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 11.502680 0.011572 993.990
## log(median_income) 0.571391 0.008457 67.567
## ocean_proximityINLAND -0.697405 0.015989 -43.617
## ocean_proximityISLAND 2.032871 1.038587 1.957
## ocean_proximityNEAR BAY -0.053793 0.025064 -2.146
## ocean_proximityNEAR OCEAN -0.154993 0.023743 -6.528
## log(median_income):ocean_proximityINLAND 0.176592 0.012770 13.829
## log(median_income):ocean_proximityISLAND -1.277647 1.028706 -1.242
## log(median_income):ocean_proximityNEAR BAY 0.078022 0.018579 4.200
## log(median_income):ocean_proximityNEAR OCEAN 0.153048 0.018212 8.404
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## log(median_income) < 2e-16 ***
## ocean_proximityINLAND < 2e-16 ***
## ocean_proximityISLAND 0.0503 .
## ocean_proximityNEAR BAY 0.0319 *
## ocean_proximityNEAR OCEAN 6.83e-11 ***
## log(median_income):ocean_proximityINLAND < 2e-16 ***
## log(median_income):ocean_proximityISLAND 0.2143
## log(median_income):ocean_proximityNEAR BAY 2.69e-05 ***
## log(median_income):ocean_proximityNEAR OCEAN < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3345 on 19665 degrees of freedom
## Multiple R-squared: 0.6068, Adjusted R-squared: 0.6066
## F-statistic: 3372 on 9 and 19665 DF, p-value: < 2.2e-16
# obtain only a small improvement in r^2 from the interaction.
# we'll see shortly that the correct test would be an anova() comparing both models
anova(mod2_proximity, mod3)
## Warning in anova.lmlist(object, ...): models with response
## '"log(median_house_value)"' removed because response differs from model 1
# the small p-value suggests interaction is statistically significant, but the effect is small.
house_tidy %>%
ggplot(aes(x = log(median_income),
y = log(median_house_value),
colour = ocean_proximity)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ ocean_proximity)
## `geom_smooth()` using formula 'y ~ x'
# not much evidence that the gradient of the line varies significantly with ocean_proximity. Island so a change in the gradient however, there is not enough data points to make any conclusions.